NTSB Aviation Accident Database Exploration by Hamza Bendemra

Introduction

Our dataset features 81,013 observations of 31 variables (shown below) which are related to aviation accidents recorded. Each accident is provided an “Event.Id”. We can also see that variables provided information on a variety of topics including the date and location of the accident, the model and type aircraft, information on the sustained injuries to passengers and to the aircraft, and the reported weather conditions at the time.

## [1] 81013    31
##  [1] "Event.Id"               "Investigation.Type"    
##  [3] "Accident.Number"        "Event.Date"            
##  [5] "Location"               "Country"               
##  [7] "Latitude"               "Longitude"             
##  [9] "Airport.Code"           "Airport.Name"          
## [11] "Injury.Severity"        "Aircraft.Damage"       
## [13] "Aircraft.Category"      "Registration.Number"   
## [15] "Make"                   "Model"                 
## [17] "Amateur.Built"          "Number.of.Engines"     
## [19] "Engine.Type"            "FAR.Description"       
## [21] "Schedule"               "Purpose.of.Flight"     
## [23] "Air.Carrier"            "Total.Fatal.Injuries"  
## [25] "Total.Serious.Injuries" "Total.Minor.Injuries"  
## [27] "Total.Uninjured"        "Weather.Condition"     
## [29] "Broad.Phase.of.Flight"  "Report.Status"         
## [31] "Publication.Date"

Let’s have a look at the first 5 rows of the dataset for an initial look at how the dataframe is organised.

##         Event.Id Investigation.Type Accident.Number Event.Date
## 1 20171230X01728           Accident      WPR18LA058 2017-12-29
## 2 20171229X23639           Accident      GAA18CA094 2017-12-28
## 3 20180102X10253           Accident      CEN18LA064 2017-12-28
## 4 20171227X12320           Accident      GAA18CA096 2017-12-27
## 5 20171228X35517           Accident      CEN18LA062 2017-12-27
##             Location       Country Latitude  Longitude Airport.Code
## 1 San Bernardino, CA United States 34.09167 -117.24778          SBD
## 2      Spearfish, SD United States 44.48472 -103.78805          SPF
## 3       Paulding, OH United States 41.18889  -84.70806             
## 4     Upper Lake, CA United States 39.44306 -122.96028          1Q5
## 5  Michigan City, IN United States 41.69528   86.82556          MGC
##                     Airport.Name Injury.Severity Aircraft.Damage
## 1            SAN BERNARDINO INTL       Non-Fatal     Substantial
## 2    BLACK HILLS-CLYDE ICE FIELD       Non-Fatal     Substantial
## 3                            N/A       Non-Fatal     Substantial
## 4                                    Unavailable                
## 5 Michigan City Municipal Airpor       Non-Fatal     Substantial
##   Aircraft.Category Registration.Number                     Make
## 1          Airplane              N222AT                   CESSNA
## 2          Airplane               N11KX                    PIPER
## 3          Airplane                N4JW PIPISTREL DOO AJDOVSCINA
## 4          Airplane              N3945A               AEROPRO CZ
## 5          Airplane              N525KT                   CESSNA
##        Model Amateur.Built Number.of.Engines   Engine.Type
## 1        210            No                 1 Reciprocating
## 2 PA 32-301T            No                 1              
## 3   VIRUS SW                               1 Reciprocating
## 4    EUROFOX            No                 1              
## 5       525A            No                NA              
##             FAR.Description Schedule Purpose.of.Flight Air.Carrier
## 1 Part 91: General Aviation                   Personal            
## 2 Part 91: General Aviation                   Personal            
## 3 Part 91: General Aviation                   Personal            
## 4 Part 91: General Aviation                   Personal            
## 5 Part 91: General Aviation                Positioning            
##   Total.Fatal.Injuries Total.Serious.Injuries Total.Minor.Injuries
## 1                   NA                     NA                   NA
## 2                   NA                     NA                   NA
## 3                   NA                     NA                   NA
## 4                   NA                     NA                   NA
## 5                   NA                     NA                   NA
##   Total.Uninjured Weather.Condition Broad.Phase.of.Flight Report.Status
## 1               1               VMC               LANDING   Preliminary
## 2               1               VMC                         Preliminary
## 3               1               VMC                CRUISE   Preliminary
## 4              NA                                           Preliminary
## 5               2               VMC               LANDING   Preliminary
##   Publication.Date
## 1       2018-01-08
## 2       2018-01-09
## 3       2018-01-04
## 4       2018-01-08
## 5       2018-01-08

The structure of the dataframe (shown below) shows that there is a total of 81,013 observations of 31 variables in this dataset. The variable types include factorial, numeric, and integer variables.

## 'data.frame':    81013 obs. of  31 variables:
##  $ Event.Id              : Factor w/ 79835 levels "20001204X00000",..: 79832 79831 79833 79828 79830 79829 79827 79824 79826 79823 ...
##  $ Investigation.Type    : Factor w/ 2 levels "Accident","Incident": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Accident.Number       : Factor w/ 81013 levels "ANC00FA018","ANC00FA024",..: 81010 45832 16696 45833 16695 80965 81009 37058 37044 16665 ...
##  $ Event.Date            : Factor w/ 13005 levels "1948-10-24","1962-07-19",..: 13005 13004 13004 13003 13003 13002 13001 13000 13000 13000 ...
##  $ Location              : Factor w/ 25708 levels "",", Gabon",", Ireland",..: 20287 21673 17753 23609 14620 20476 17932 18859 23566 1488 ...
##  $ Country               : Factor w/ 178 levels "","Afghanistan",..: 169 169 169 169 169 169 169 164 169 169 ...
##  $ Latitude              : num  34.1 44.5 41.2 39.4 41.7 ...
##  $ Longitude             : num  -117.2 -103.8 -84.7 -123 86.8 ...
##  $ Airport.Code          : Factor w/ 9788 levels "","-","---",".",..: 8485 8733 1 666 6879 8701 1 1 1 3498 ...
##  $ Airport.Name          : Factor w/ 23147 levels "","---","??vora Airport",..: 18217 1874 14356 1 13525 10323 14356 14356 14356 1320 ...
##  $ Injury.Severity       : Factor w/ 125 levels "Fatal(1)","Fatal(10)",..: 124 124 124 125 124 125 124 49 124 88 ...
##  $ Aircraft.Damage       : Factor w/ 4 levels "","Destroyed",..: 4 4 4 1 4 4 4 2 4 2 ...
##  $ Aircraft.Category     : Factor w/ 14 levels "","Airplane",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Registration.Number   : Factor w/ 70233 levels "","00SLB","1227G",..: 11326 2645 32919 24570 35350 47841 45460 1 64577 13638 ...
##  $ Make                  : Factor w/ 7728 levels "","107.5 Flying Corporation",..: 1303 5386 5405 103 1303 1303 1303 1303 1303 1303 ...
##  $ Model                 : Factor w/ 11599 levels "","-269C","-737-222",..: 616 7912 11241 5090 1225 378 194 254 254 912 ...
##  $ Amateur.Built         : Factor w/ 3 levels "","No","Yes": 2 2 1 2 2 2 2 2 2 2 ...
##  $ Number.of.Engines     : int  1 1 1 1 NA 1 1 1 1 2 ...
##  $ Engine.Type           : Factor w/ 14 levels "","Electric",..: 8 1 8 1 1 1 8 8 8 8 ...
##  $ FAR.Description       : Factor w/ 19 levels "","Armed Forces",..: 15 15 15 15 15 15 15 4 15 15 ...
##  $ Schedule              : Factor w/ 4 levels "","NSCH","SCHD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Purpose.of.Flight     : Factor w/ 23 levels "","Aerial Application",..: 16 16 16 16 17 16 16 23 16 16 ...
##  $ Air.Carrier           : Factor w/ 2930 levels "","(DBA: [EMS])",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Total.Fatal.Injuries  : int  NA NA NA NA NA NA NA 2 NA 5 ...
##  $ Total.Serious.Injuries: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Minor.Injuries  : int  NA NA NA NA NA NA NA NA 1 NA ...
##  $ Total.Uninjured       : int  1 1 1 NA 2 NA 2 NA NA NA ...
##  $ Weather.Condition     : Factor w/ 4 levels "","IMC","UNK",..: 4 4 4 1 4 1 4 1 4 2 ...
##  $ Broad.Phase.of.Flight : Factor w/ 13 levels "","APPROACH",..: 7 1 4 1 7 1 1 11 8 11 ...
##  $ Report.Status         : Factor w/ 4 levels "Factual","Foreign",..: 3 3 3 3 3 3 3 2 3 3 ...
##  $ Publication.Date      : Factor w/ 3709 levels "","1980-04-16",..: 3705 3706 3704 3705 3705 3703 3703 1 3705 3702 ...

Data Wrangling

Since this a United States database from the United States, it is highly likely that the majority of the reports accidents are from the US. Let’s check that calculating the percentage of reported accidents in this database from the US.

## [1] 0.9404417 1.0000000

As expected, the majority of accidents (over 94%) in this databased are from the US. Hence, we will be focusing on the accidents that took place in the US in this analysis.

As shown below, the new dataframe now features 76,188 observations.

## [1] 76188    31

We will further tailor our dataframe for this analysis by focusing on 7 variables of interest, out of the 31 available variables. Let’s focus on: (1) the date, (2) the total fatal injuries, (3) the engine type, (4) the weather condition, (5) the broad phase of flight, (6) Latitude, and (7) Longitude

Let’s check that our new dataframe ‘pf_study’ features all 5 variables.

## [1] "Event.Date"            "Total.Fatal.Injuries"  "Engine.Type"          
## [4] "Weather.Condition"     "Broad.Phase.of.Flight" "Latitude"             
## [7] "Longitude"
## [1] 76188     7

As shown above, the dataframe we will be using for this study features 76,188 observations of 5 variables. Below is a summary of the dataframe.

Looking at the summary output we can get a quick overview of the distribution for each of our variables.

##       Event.Date    Total.Fatal.Injuries        Engine.Type   
##  1982-05-16:   25   Min.   :  0.000      Reciprocating:64835  
##  1984-06-30:   25   1st Qu.:  0.000      Turbo Shaft  : 3127  
##  1983-06-05:   24   Median :  0.000      Turbo Prop   : 2974  
##  1983-08-05:   24   Mean   :  0.535      Turbo Fan    : 1967  
##  1984-08-25:   24   3rd Qu.:  1.000      Unknown      : 1387  
##  1986-05-17:   24   Max.   :265.000                   : 1227  
##  (Other)   :76042   NA's   :22821        (Other)      :  671  
##  Weather.Condition Broad.Phase.of.Flight    Latitude     
##     :  303         LANDING    :19513     Min.   :-61.88  
##  IMC: 5369         TAKEOFF    :15370     1st Qu.: 33.69  
##  UNK:  597         CRUISE     :10548     Median : 38.35  
##  VMC:69919         MANEUVERING: 9880     Mean   : 38.96  
##                    APPROACH   : 7789     3rd Qu.: 42.48  
##                               : 2736     Max.   : 87.15  
##                    (Other)    :10352     NA's   :50372   
##    Longitude      
##  Min.   :-175.37  
##  1st Qu.:-115.99  
##  Median : -95.66  
##  Mean   : -98.19  
##  3rd Qu.: -82.62  
##  Max.   : 174.33  
##  NA's   :50382

The variable related to the observation’s event date will need some further data wrangling to be able to extract meanigful information to draw conclusions from. For instance, it would be useful to break the observation’s event date by day, month, and year to investigate if there are any correlations between number of accidents/incidents and particular periods within a year.

Also, looking at the Total Fatalities, we can see that the 3rd Qu. is 1.000, the Max is 265. Clearly, this distribution will be heavily skewed. A better way of displaying this data is by grouping number of fatalities in buckets, as shown below. This will give us a better reprsentation of the distribution of fatalities across the observations in the dataset.

## 
##     [0,1]     (1,2]     (2,3]     (3,4]     (4,5]    (5,10]   (10,20] 
##     46666      4136      1227       822       215       232        33 
##   (20,50]  (50,100] (100,150] (150,200] (200,250] (250,300]      <NA> 
##        17        10         4         2         2         1     22821

In terms of the weather conditions, let’s replace blank entries in the weather conditions with “unknown”.

Similarly, let’s replace blank entries in the Phases of Flight with “unknown”.

Univariate Plots Section

In this section, we will create univariate plots for the 5 variables of interest.

Event Date

Let’s have an initial look at the ‘Event.Date’ variable.

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1948-10-24" "1987-11-30" "1995-12-17" "1997-03-06" "2005-08-05" 
##         Max. 
## "2017-12-29"

By Year

Now, let’s plots frequency histograms for the year, month, and weekday of accidents/incidents in the dataset.

It seems the majority of the observations in the dataframe are from after the early 1980s onwards. Let’s perform a quick check.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1948    1987    1995    1997    2005    2017

Let’s generate a plot from 1980 to 2017.

According to our plot, the number of accidents/incidents has substantially decreseased between 1982 and 2017 from approx. 3400 observations to approx. 1600 observations.

By Month of the Year

Next, let’s look at observations distribution by months of the year.

According to our plot, the highest frequency of accidents/incidents in a given year take place during northern hemisphere summer time (Jun-Jul-Aug).

By Day of the Week

And finally, let’s look at the observations distrubution by day of the week.

According to our plot, the highest frequency of accidents/incidents in a given week take place during the weekend (Sat-Sun).

Total Fatal Injuries

Our next variable of interest relates to the Total Fatal Injuries for each observation in the dataset.This is quantified by the number of people injured for each recorded observation.

Let’s first have a look at the different levels in the factorial variable describing the severity of injury in our dataframe.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.535   1.000 265.000   22821

Let’s generate a plot.

We can see that the majority of recorded accidents/incidents have a low number of fatalities while some outliers are displaying large numbers of fatalities. A line plot makes that even clearer.

We’ve grouped the number of fatalities in the buckets shown in the plot below.

As previously stated, the charts above show that the bulk of accidents resulting in fatalities of less than 20.

Engine Types

Next, we look at the engine types recorded in our observations. Let’s have a look at the different engin types distribution in our dataframe.

##  [1] ""                 "Electric"         "Hybrid Rocket"   
##  [4] "None"             "REC, ELEC"        "REC, TJ, REC, TJ"
##  [7] "REC, TJ, TJ"      "Reciprocating"    "TJ, REC, REC, TJ"
## [10] "Turbo Fan"        "Turbo Jet"        "Turbo Prop"      
## [13] "Turbo Shaft"      "Unknown"
##                          Electric    Hybrid Rocket             None 
##             1227                4                1               13 
##        REC, ELEC REC, TJ, REC, TJ      REC, TJ, TJ    Reciprocating 
##                1                1                2            64835 
## TJ, REC, REC, TJ        Turbo Fan        Turbo Jet       Turbo Prop 
##                1             1967              648             2974 
##      Turbo Shaft          Unknown 
##             3127             1387

Let’s generate a plot. We’ve abbreviate engine type names to improve labelling of the x-axis.

A logarithmic scale would be suited also.

According to our charts, the bulk of engine types in the reported accidents/incidents are Reciprocal engine types.

Weather Conditions

Next, we look at the weather conditions recorded in our observations. Let’s have a look at the different weather conditions in our dataframe.

##   VMC   IMC   UNK  NA's 
## 69919  5369   597   303
##   VMC   IMC   UNK  NA's 
## 69919  5369   597   303

VMC stands for visual meteorological conditions and IMC stands for instrument meteorological conditions. VMC means that conditions are such that pilots have sufficient visibility to fly the aircraft maintaining visual separation from terrain and other aircraft.

IMC means weather conditions require pilots to fly primarily by reference to instruments.

Let’s generate a plot of the weather conditions distribution in our dataframe.

According to our plot, the bulk of accidents/incidents take place during VMC weather conditions.

Broad Phases of Flight

Next, we look at the weather conditions recorded in our observations. Let’s have a look at the different weather conditions in our dataframe.

##    Length     Class      Mode 
##     76188 character character

According to our plot, the bulk of accidents/incidents take place during landing or take-off.

Latitude

Next, we look at the latitudes recorded in our observations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -61.88   33.69   38.35   38.96   42.48   87.15   50372

Longitude

Next, we look at the latitudes recorded in our observations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -175.37 -115.99  -95.66  -98.19  -82.62  174.33   50382

According to our plots, the bulk of accidents/incidents are taking place within a range of latitude of [20,50] and longitude of [-170,-70] degrees.

Univariate Analysis

What is the structure of your dataset?

The initial dataset featured 81,013 observations of 31 variables related to aviation accidents recorded. From this dataframe, a dataframe for this study was extracted. It features 76,188 observations of 5 variables of interest to us: (1) Event Date, (2) Injury Severity, (3) Engine Type, (4) Weather Condition, and (5) Broad Phase of Flight when the accident or incident occured.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are the number of fatalities and the other variables in the dataset. I’d like to determine which features are best for predicting the number of fatalities for a given observation of an accident or incident. I suspect that it will be some combination of variables that will allow us to build a predictive model for the amount of fatalities.

Did you create any new variables from existing variables in the dataset?

Yes, I had to create new variables to extract the year, month, and days of the observations. I also created a new variable to cluster fatalities numbers into buckets.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The dataset, even though provided by Kaggle, required a fair amount of data wrangling to make the best use of it. Nothing was unusual but I certainly appreciate the amount of time needed for data wrangling before being able to perfor an EDA on the dataset afer working on this dataset.

Bivariate Plots Section

Let’s look at the relationship between pairs of variables that could show interesting relationship. First a recap of available variables in our dataframe.

##  [1] "Event.Date"                    "Total.Fatal.Injuries"         
##  [3] "Engine.Type"                   "Weather.Condition"            
##  [5] "Broad.Phase.of.Flight"         "Latitude"                     
##  [7] "Longitude"                     "Event.Year"                   
##  [9] "Event.Month"                   "Event.Weekday"                
## [11] "Total.Fatal.Injuries.Brackets"

Engine Types and Total Fatal Injuries

Let’s look at the relationship of Engine Types and Total Fatal Injuries.

We can see that given that the bulk of the distribution has a total fatal injuries under 20, let’s zoom in on that portion of the data.

According to the plot, the bulk of the data for fatalities under 20 is with the engine type Reciprocated. The first plot shows that the Tubo-Fan engine has more outliers with higher number of fatalities than other engins. This is likely due to the use of Turbo-Fan engines use on some large commercial aircraft.

Weather Conditions and Total Fatal Injuries

Let’s look at the relationship of Weather Conditions and Total Fatal Injuries.

Weather conditions do not show a particularly strong relationship with total fatal injuries. The bulk of the distribution is associate with VMC weather conditions. However, that is likely to the fact that the vast majority of flights are flown in VMC conditions.

Phase of Flight and Total Fatal Injuries

Let’s look at the relationship of Phase of Flight and Total Fatal Injuries

The plots show that Take-Off and Apprach are associated with outliers with high number of fatalities. These two phases of flight are often referred to as “critical phases of flight” for that particular reason. They’re high risk.

This boxplot further emphasises the presence of high outliers for take-off and approaches phases of flight. It also indicates that outliers are present for all broad phases of flight.

Event Date and Total Fatal Injuries

Let’s look at the relationship of Event Date and Total Fatal Injuries

By Month of the Year

The distribution is fairly distributed across all months of the year.

The box plot shows also outliers distributed fairly across all months of the year. However, norther winter months show higher fatalities for the upper bound of the interquartile range compared to the northern summer months.

By Day of the Week

The distribution looks to be fairly distributed across all days of the week.

Again, the box plot shows that the distribution is fairly distributed across all days of the week.

Longitude and Latitude and Total Fatal Injuries

Both latitude and longitude plots are biased toward the United States coordinates due to our initial filtering of the data to focus on US observations.

Broad Phase of Flight and Weather Conditions

The plot indicates that there is a higher frequency of recorded observations for IMC flying conditions for “cruise” and the “approach” compared to the rest of the phases of flight flown in IMC. In VMC conditions, the distribution looks fairly evenly distributed across all phases of flight.

Broad Phase of Flight and Event Month

The plot indicates that there is a higher frequency of recorded observations for nothern summer months during Landing and Take-off. Across all months, the heat map also shows that the Take-off and Landing register the highest number of observations.

Longitude and Latitude

These plots are essently maps of the US and clearly shows that the coastal states are more heavily impacted that the Mid-western states and most of Alaska. This can be explained by the volume of flight to/from destinations across the US.

Latitude and Longitude Correlations to Total Fatalities

## $estimate
##         cor 
## 0.001698444
## $estimate
##         cor 
## -0.02616494

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
Regarding the relationship of Engine Types and Total Fatal Injuries, we can see

some outliers particularly for the engine type TrbF, which is the Turbo-Fan engine type. Reciprocal, Turbo-Prop and Turbo-Shaft show the largest number of fatalities.

Regarding of the Weather Conditions and Total Fatal Injuries, the VMC and IMC conditions show the highest numbers of fatalities recorded. However, that may be insignificant as commercial aircraft can typically only fly if VMC or IMC conditions are present.

Regarding the relationship Phase of Flight and Total Fatal Injuries, there doesn’t seem to be any particular one phase of flight that jumps out. However, a set of flight phases show the highest numbers of fatalities, namely the approach and take-off.

Regarding the relationship of Event Date and Total Fatal Injuries, outliers with high number of fatalities are more present during northern summer. For the bulk of the distribution, no particular month shows a higher number of fatalities. Similarly for the Weekday, no particular Weekday shows a higher number of fatalities.

Regarding the relationship between Longitude/Latitude and Total Fatal Injuries, we can see that the incidents are clustered within certain ranges. This is the result of our selection to focus only on observations noting “United States” as Country. The plot of Latitude vs Longitude clearly shows the bulk of the observations being in the US. The correlation factors between longitude and latitude and total number of fatalities is a weak positive relationship.

Did you observe any interesting relationships between the other features
The relationship between weather condition and flight phase of recorded

accidents/incidents was interesting. It seems there is higher frequency of recorded observations for certain combinations of weather and what we could qualify higher risk phases of flight. For instance, IMC flying conditions while during “cruise” and the “approach”.

What was the strongest relationship you found?

The stronger relationship I found was between engine types and total number of fatalities; and between weather condition and flight phase of recorded accidents/incidents in the dataframe.

Multivariate Plots Section

Longitude and Latitude by Month

First, let’s look at Latitude vs Longitude of observations by Month of the Year.

The distribution of month across latitude and longitude is fairly spread across the US, with a slightly higher prevalence of observations during the winter in southern states like Florida.

Longitude and Latitude by Weather Conditions

Let’s now look at latitude vs longitude add layer for the weather condition.

Weather condition VMC seems to be quite consistent except for patches of primarily IMC conditions for certain discrete areas.

Broad Phase of Flight and Event Month by Weather Conditions

Let’s now look at the relationship of Broad Phase of Flight vs Month by Weather Conditions.

When looking at the relationship between Broad Phase of Flight vs Month and Weather Condition, we can see that accident/incidents primarily take place during VMC weather condition. However, for certain months of the year such as December an January, IMC conditions are a non-negligible portion of the observations, particularly during Approach and Cruise phases of flight.

Broad Phase of Flight and Event Month by Weather Conditions

Next, let’s look at Total Fatal Injuries vs Broad Phases of Flight by Weather Condition.

In terms of the relationship between Total Fatal Injuries vs Broad Phases of Flight and Weather Condition, the scatter plot clearly indicates that the total number of fatalies shows an increased number with IMC as weather condition and during Cruise and Approach phases of flight. Furthermore, we can see that in general, low fatality count (less then 2) primarily took place in VMC weather conditions.

Total Fatal Injuries and Engine Type by Year

Finally, let’s look at the relationship between Total Fatal Injuries vs Engine Type by Year.

The relationship between Total Fatal Injuries vs Engine Type by Year indicates that certain engines have become prevaent in different time periods, and hence were more likely to be in use during accidents or incidents.

Linear Model

Let’s build a linear model using the variables in our “pf_study” dataframe.

## 
## Calls:
## m1: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type), data = pf_study)
## m2: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition, 
##     data = pf_study)
## m3: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition + 
##     Broad.Phase.of.Flight, data = pf_study)
## m4: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition + 
##     Broad.Phase.of.Flight + Latitude, data = pf_study)
## m5: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition + 
##     Broad.Phase.of.Flight + Latitude + Event.Year, data = pf_study)
## m6: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition + 
##     Broad.Phase.of.Flight + Latitude + Event.Month, data = pf_study)
## m7: lm(formula = I(Total.Fatal.Injuries) ~ I(Engine.Type) + Weather.Condition + 
##     Broad.Phase.of.Flight + Latitude + Event.Weekday, data = pf_study)
## 
## =======================================================================================================================================================
##                                                      m1              m2              m3             m4            m5            m6            m7       
## -------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                                        1.873***        1.232***        1.254***      1.442***     -4.185         1.405***      1.347***  
##                                                     (0.207)         (0.209)         (0.223)       (0.161)       (8.589)       (0.177)       (0.171)    
##   I(Engine.Type): Hybrid Rocket                     -0.873          -0.232          -0.455         0.377         0.368         0.431         0.407     
##                                                     (2.788)         (2.713)         (2.724)       (1.441)       (1.441)       (1.441)       (1.439)    
##   I(Engine.Type): REC, TJ, REC, TJ                   1.127           0.732                                                                             
##                                                     (2.788)         (2.713)                                                                            
##   I(Engine.Type): REC, TJ, TJ                        0.127           0.250           0.113         0.477         0.481         0.532         0.418     
##                                                     (1.977)         (1.924)         (1.924)       (0.966)       (0.966)       (0.966)       (0.965)    
##   I(Engine.Type): Reciprocating                     -1.407***       -0.861***       -0.708**       0.209         0.213         0.210         0.221     
##                                                     (0.207)         (0.209)         (0.221)       (0.121)       (0.121)       (0.121)       (0.121)    
##   I(Engine.Type): TJ, REC, REC, TJ                   1.127           1.768           1.908         1.637         1.643         1.681         1.727     
##                                                     (2.788)         (2.713)         (2.712)       (1.360)       (1.360)       (1.361)       (1.359)    
##   I(Engine.Type): Turbo Fan                          0.472*          0.781***        1.018***      0.666**       0.673***      0.667**       0.710***  
##                                                     (0.223)         (0.224)         (0.236)       (0.203)       (0.204)       (0.203)       (0.203)    
##   I(Engine.Type): Turbo Jet                         -1.196***       -0.740**        -0.528*        0.260         0.268         0.247         0.286     
##                                                     (0.243)         (0.243)         (0.254)       (0.231)       (0.231)       (0.231)       (0.231)    
##   I(Engine.Type): Turbo Prop                        -0.749***       -0.329          -0.186         0.793***      0.796***      0.786***      0.826***  
##                                                     (0.216)         (0.218)         (0.230)       (0.142)       (0.142)       (0.142)       (0.142)    
##   I(Engine.Type): Turbo Shaft                       -1.219***       -0.686**        -0.658**       0.505***      0.510***      0.499***      0.540***  
##                                                     (0.216)         (0.217)         (0.229)       (0.148)       (0.148)       (0.148)       (0.148)    
##   I(Engine.Type): Unknown                           -1.711***       -1.092***       -0.907***     -0.490        -0.490        -0.431        -0.460     
##                                                     (0.220)         (0.221)         (0.234)       (0.801)       (0.801)       (0.803)       (0.801)    
##   Weather.Condition: IMC/VMC                                         1.035***        0.918***      0.389***      0.391***      0.383***      0.394***  
##                                                                     (0.041)         (0.043)       (0.055)       (0.056)       (0.056)       (0.055)    
##   Weather.Condition: UNK/VMC                                         0.734***        0.629***     -0.107        -0.128        -0.103        -0.087     
##                                                                     (0.115)         (0.120)       (0.410)       (0.411)       (0.410)       (0.410)    
##   Broad.Phase.of.Flight: CLIMB/APPROACH                                              0.208**      -0.038        -0.036        -0.032        -0.032     
##                                                                                     (0.073)       (0.111)       (0.111)       (0.111)       (0.111)    
##   Broad.Phase.of.Flight: CRUISE/APPROACH                                            -0.079        -0.014        -0.007        -0.019        -0.014     
##                                                                                     (0.046)       (0.073)       (0.073)       (0.073)       (0.072)    
##   Broad.Phase.of.Flight: DESCENT/APPROACH                                           -0.155*       -0.187        -0.187        -0.194        -0.184     
##                                                                                     (0.075)       (0.108)       (0.108)       (0.108)       (0.108)    
##   Broad.Phase.of.Flight: GO-AROUND/APPROACH                                         -0.108         0.077         0.075         0.074         0.075     
##                                                                                     (0.086)       (0.129)       (0.129)       (0.129)       (0.129)    
##   Broad.Phase.of.Flight: LANDING/APPROACH                                           -0.563***     -0.683***     -0.681***     -0.688***     -0.682***  
##                                                                                     (0.044)       (0.114)       (0.114)       (0.114)       (0.114)    
##   Broad.Phase.of.Flight: MANEUVERING/APPROACH                                        0.174***     -0.110        -0.109        -0.114        -0.111     
##                                                                                     (0.047)       (0.064)       (0.064)       (0.064)       (0.064)    
##   Broad.Phase.of.Flight: OTHER/APPROACH                                              0.201        -0.815        -0.819        -0.813        -0.823     
##                                                                                     (0.266)       (0.482)       (0.482)       (0.482)       (0.481)    
##   Broad.Phase.of.Flight: STANDING/APPROACH                                          -0.542***     -0.661**      -0.662**      -0.678**      -0.640**   
##                                                                                     (0.107)       (0.210)       (0.210)       (0.210)       (0.210)    
##   Broad.Phase.of.Flight: TAKEOFF/APPROACH                                           -0.162***     -0.074        -0.076        -0.077        -0.079     
##                                                                                     (0.044)       (0.069)       (0.069)       (0.069)       (0.069)    
##   Broad.Phase.of.Flight: TAXI/APPROACH                                              -0.693***     -0.689*       -0.683*       -0.746*       -0.691*    
##                                                                                     (0.078)       (0.346)       (0.346)       (0.347)       (0.346)    
##   Broad.Phase.of.Flight: UNKNOWN/APPROACH                                            0.606***     -0.100        -0.098        -0.105        -0.113     
##                                                                                     (0.125)       (0.158)       (0.158)       (0.158)       (0.158)    
##   Latitude                                                                                        -0.000        -0.000        -0.000         0.000     
##                                                                                                   (0.003)       (0.003)       (0.003)       (0.003)    
##   Event.Year                                                                                                     0.003                                 
##                                                                                                                 (0.004)                                
##   Event.Month: Feb/Jan                                                                                                         0.312**                 
##                                                                                                                               (0.115)                  
##   Event.Month: Mar/Jan                                                                                                         0.067                   
##                                                                                                                               (0.113)                  
##   Event.Month: Apr/Jan                                                                                                        -0.085                   
##                                                                                                                               (0.109)                  
##   Event.Month: May/Jan                                                                                                         0.016                   
##                                                                                                                               (0.107)                  
##   Event.Month: Jun/Jan                                                                                                        -0.009                   
##                                                                                                                               (0.103)                  
##   Event.Month: Jul/Jan                                                                                                         0.095                   
##                                                                                                                               (0.101)                  
##   Event.Month: Aug/Jan                                                                                                         0.090                   
##                                                                                                                               (0.101)                  
##   Event.Month: Sep/Jan                                                                                                        -0.009                   
##                                                                                                                               (0.105)                  
##   Event.Month: Oct/Jan                                                                                                        -0.022                   
##                                                                                                                               (0.106)                  
##   Event.Month: Nov/Jan                                                                                                        -0.014                   
##                                                                                                                               (0.108)                  
##   Event.Month: Dec/Jan                                                                                                         0.047                   
##                                                                                                                               (0.107)                  
##   Event.Weekday: Tue/Mon                                                                                                                    -0.098     
##                                                                                                                                             (0.084)    
##   Event.Weekday: Wed/Mon                                                                                                                     0.010     
##                                                                                                                                             (0.081)    
##   Event.Weekday: Thu/Mon                                                                                                                     0.093     
##                                                                                                                                             (0.080)    
##   Event.Weekday: Fri/Mon                                                                                                                     0.065     
##                                                                                                                                             (0.078)    
##   Event.Weekday: Sat/Mon                                                                                                                     0.107     
##                                                                                                                                             (0.075)    
##   Event.Weekday: Sun/Mon                                                                                                                     0.226**   
##                                                                                                                                             (0.076)    
## -------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                                          0.012           0.023           0.033         0.045         0.045         0.049         0.049     
##   adj. R-squared                                     0.012           0.023           0.032         0.040         0.040         0.042         0.043     
##   sigma                                              2.781           2.705           2.703         1.354         1.355         1.353         1.352     
##   F                                                 65.785         103.844          80.980         9.154         8.789         6.728         8.005     
##   p                                                  0.000           0.000           0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood                               -130299.604     -128676.091     -127081.722     -7726.351     -7726.136     -7717.351     -7715.844     
##   Deviance                                      412595.700      389914.575      384537.871      8198.306      8197.518      8165.527      8160.051     
##   AIC                                           260623.208      257380.181      254211.444     15502.703     15504.271     15506.703     15493.689     
##   BIC                                           260729.828      257504.555      254424.365     15662.960     15670.938     15737.473     15692.407     
##   N                                              53367           53307           52666          4493          4493          4493          4493         
## =======================================================================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
The distribution of month across latitude and longitude is quite spread across

the US, although we do notice a higher prevalence of oberservations during the winter in southern states like Florida, and in summer in northern states like Alaska. This likely correspond to higher volumes of flights to/from these destinations in those periods of the year. In terms of latitude vs longitude by weather condition, The weather condition VMC seems to be quite consistent except for patches of primarily IMC conditions from certain discrete areas.

Were there any interesting or surprising interactions between features?

The relationship between Total Fatal Injuries vs Engine Type by Year was interesting as it also how certain engines have become prevaent in different time periods. For instance, Turbo-Jet and Turbo-Fan powered aircraft show a higher number of fatalities in later years whereas Reciprocated engies show a distribution of fatalities in earlier years.

OPTIONAL: Did you create any models with your dataset?

A liner model was created but only showed a low R-squared value of 0.049. Hence, the variables in the linear model account for 4.9% of the variance in the total number of fatalities.


Final Plots and Summary

Plot One

Description One

The relationship between Total Fatal Injuries vs Engine Type by Year was interesting as it also how certain engines have become prevaent in different time periods.

Plot Two

Description Two

This is a great plot as the observations in our dataset clearly cover the entirety of the US space, hence outlining map of the US one oberservation at the time. A sad chart however as it shows that every state suffered an aviation tragedy between 1948 and 2017.

Plot Three

Description Three

This is a great plot as it gives an indication of which phases of flight to be particularly worried about given a particular weather condition. For instance, if you know that the weather condition is VMC, watch out for the Take-off and Landing! ——

Reflection

I was particularly suprised by two three: (1) the amount of time it took to read about and understand each variable to make the best use of the given data; (2) the amount of computational power it took to process a dataset of a little less then 80,000 observations - a small sample size in our day & age; (3) and finally how much data wrangling it took to prep the data despite the fact that this is considered “clean data” after Kaggle already did the tidying.

I mostly got around the challenges I found while doing the analysis by referring to my classe notes and reading StackOverflow question anwers. Also, using an R Markdown file also quick trials & erros which is very useful when exploring.

The info from this study could potentially inform pilots to pay particular attention for certain combination of weather conditions and phases of flight. As demonstrated by the liner model outcomes, it is quite difficult to predict number of fatalities but more variables could have potentially increase the R-squared factor of the predictive linear model shown in this study.

A great project nevertheless. I really enjoyed it.